Decentralized Stochastic Control with Partial 
History Sharing: A Common Information 

Approach 

Ashutosh Nayyar, Aditya Mahajan and Demosthenis Teneketzis 

Abstract 

A general model of decentxalized stochastic control called partial history sharing information structure 
is presented. In this model, at each step the controllers share part of their observation and control history 
with each other This general model subsumes several existing models of information sharing as special 
cases. Based on the information commonly known to all the controllers, the decentralized problem is 
reformulated as an equivalent centralized problem from the perspective of a coordinator The coordinator 
knows the common information and select prescriptions that map each controller's local information to 
its control actions. The optimal control problem at the coordinator is shown to be a partially observable 
Markov decision process (POMDP) which is solved using techniques from Markov decision theory. This 
approach provides (a) structural results for optimal strategies, and (b) a dynamic program for obtaining 
optimal strategies for all controllers in the original decentralized problem. Thus, this approach unifies the 
various ad-hoc approaches taken in the literature. In addition, the structural results on optimal control 
strategies obtained by the proposed approach cannot be obtained by the existing generic approach (the 
person-by-person approach) for obtaining structural results in decentralized problems; and the dynamic 
program obtained by the proposed approach is simpler than that obtained by the existing generic approach 
(the designer's approach) for obtaining dynamic programs in decentralized problems. 

Index Terms 

Decentralized Control, Stochastic Control, Information Structures, Markov Decision Theory, Team 
Theory 

I. Introduction 

Stochastic control theory provides analytic and computational techniques for centralized 
decision making in stochastic systems with noisy observations. For specific models such as 
Markov decision processes and linear quadratic and Gaussian systems, stochastic control gives 

Preliminary version of this paper appeared in the proceedings of the 46th AUerton conference on communication, control, and 
computation, 2008 (see 1 1 1). 
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results that are intuitively appealing and computationally tractable. However, these results are 
derived under the assumption that all decisions are made by a centralized decision maker who 
sees all observations and perfectly recalls past observations and actions. This assumption of 
a centralized decision maker is not true in a number of modern control applications such as 
networked control systems, communication and queuing networks, sensor networks, and smart 
grids. In such applications, decisions are made by multiple decision makers who have access 
to different information. In this paper, we investigate such problems of decentralized stochastic 
control. 

The techniques from centralized stochastic control cannot be directly applied to decentralized 
control problems. Nonetheless, two general solution approaches that indirectly use techniques 
from centralized stochastic control have been used in the literature: (i) the person-by-person 
approach which takes the viewpoint of an individual decision maker (DM); and (ii) the designer's 
approach which takes the viewpoint of the collective team of DMs. 

The person-by-person approach investigates the decentralized control problem from the 
viewpoint of one DM, say DM i and proceeds as follows: (i) arbitrarily fix the strategy of 
all DMs except DM i; and (ii) use centralized stochastic control to derive structural properties 
for the optimal best-response strategy of DM i. If such a structural property does not depend on 
the choice of the strategy of other DMs, then it also holds for globally optimal strategy of DM i. 
By cyclically using this approach for all DMs, we can identify the structure of globally optimal 
strategies for all DMs. 

A variation of this approach may be used to identify person-by-person optimal strategies. The 
variation proceeds iteratively as follows. Start with an initial guess for the strategies of all DMs. 
At each iteration, select one DM (say DM i), and change its strategy to the best response strategy 
given the strategy of all other DMs. Repeat the process until a fixed point is reached, i.e., when 
no DM can improve performance by unilaterally changing its strategy. The resulting strategies 
are person-by-person optimal [[2|, and in general, not globally optimal. 

In summary, the person-by-person approach identifies structural properties of globally optimal 
strategies and provides an iterative method to obtain person-by-person optimal strategies. This 
method has been successfully used to identify structural properties of globally optimal strategies 
for various applications including real-time communication |[3j-|j7J, decentralized hypothesis 
testing and quickest change detection [[8|-[16|, and networked control systems [17|-|I9|. Under 
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certain conditions, the person-by-person optimal strategies found by this approach are globally 
optimal 0, [|20), [|2T|. 



The designer's approach, which is developed in [22 1, [23 1, investigates the decentralized 
control problem from the viewpoint of the collective team of DMs or, equivalently, from the 
viewpoint of a system designer who knows the system model and probability distribution of the 
primitive random variables and chooses control strategies for all DMs. Effectively, the designer is 
solving a centralized planning problem. The designer's approach proceeds by: (i) modeling this 
centralized planning problem as a multi-stage, open-loop stochastic control problem in which 
the designer's decision at each time is the control law for that time for all DMs; and (ii) using 
centralized stochastic control to obtain a dynamic programming decomposition. Each step of 
the resulting dynamic program is a functional optimization problem (in contrast to centralized 
dynamic programming where each step is a parameter optimization problem). 

The designer approach is often used in tandem with the person-by-person approach as follows. 
First, the person-by-person approach is used to identify structural properties of globally optimal 
strategies. Then, restricting attention to strategies with the identified structural property, the 
designer's approach is used to obtain a dynamic programming decomposition for selecting the 
globally optimal strategy. Such a tandem approach has been used in various applications including 



real-time communication [B|, [24 1, [25 1, decentralized hypothesis testing [13], and networked 



control systems [17[, [18[. 



In addition to the above general approaches, other specialized approaches have been developed 
to address specific problems in decentralized systems. Decentralized problems with partially 
nested information structure were defined and studied in [f26]. Decentralized linear quadratic 
Gaussian (LQG) control problems with two controllers and partially nested information structure 



were studied in [27 1, [28|. Partially nested decentralized LQG problems with controllers connected 



via a graph were studied in [29[, [30[. A generalization of partial nestedness called stochastic 
nestedness was defined and studied in [31]. An important property of LQG control problems 
with partially nested information structure is that there exists an affine control strategy which is 
globally optimal. In general, the problem of finding the best affine control strategies may not 
be a convex optimization problem. Conditions under which the problem of determining optimal 
control strategies within the class of affine control strategies becomes a convex optimization 



problem were identified in [32|, [33 1 
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Decentralized stochastic control problems with specific models of information sharing among 
controllers have also been studied in the literature. Examples include systems with delayed 
sharing information structures [34|-[36|, systems with periodic sharing information structure [|37|, 



control sharing information structure |38|, [39|, systems with broadcast information structure [19], 
and systems with common and private observations [[T|. 

In this paper, we present a new general model of decentralized stochastic control called partial 
history sharing information structure. In this model, we assume that: (i) controllers sequentially 
share part of their past data (past observations and control) with each other by means of a 
shared memory; and (ii) all controllers have perfect recall of commonly available data (common 
information). This model subsumes a large class of decentralized control models in which 
information is shared among the controllers. 

For this model, we present a general solution methodology that reformulates the original 
decentralized problem into an equivalent centralized problem from the perspective of a coordinator. 
The coordinator knows the common information and selects prescriptions that map each controller's 
local information to its control actions. The optimal control problem at the coordinator is shown 
to be a partially observable Markov decision process (POMDP) which is solved using techniques 
from Markov decision theory. This approach provides (a) structural results for optimal strategies, 
and (b) a dynamic program for obtaining optimal strategies for all controllers in the original 
decentralized problem. Thus, this approach unifies the various ad-hoc approaches taken in the 
hterature. 



A similar solution approach is used in |36| for a model that is a special case of the model 



presented in this paper. We present an information state (Eq. (51)) for the model of [36 1 that is 



simpler than that presented in p6\ Theorem 2]. A preliminary version of the general solution 
approach presented here was presented in [1] for a model that had features (e.g., direct but 
noisy communication links between controllers) that are not necessary for partial history sharing. 
However, it can be shown that by suitable redefinition of variables, the model in [|T| can be recast 
as an instance of the model in this paper and vice versa (see Appendix |C]). The information state 
for partial history sharing that is presented in this paper (see Thereom |4]) is simpler than that 
presented in [1, Eq. (39)]. 
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A. Common Information Approach for a Static Team Problem 

We first illustrate how common information can be used in a static team problem with two 
controllers. Let X denote the state of nature and Y* , Y^, Y"^ be three correlated random variables 
that depend on X. Assume that the joint distribution of (X, F*, F\ F^) is given. 

Controller i, i = 1,2, observes and chooses a control action t/* = g''{Y*,Y''). The 

system incurs a cost /(X, [/^, f/^). The control objective is to choose (g^^g^) to minimize 

J{g\g') := E^^'^^'\l{X,U\U')] 

If all the system variables are finite valued, we can solve the above optimization problem by 
a brute force search over all control strategies (g^,g'^). For example, if all variables are binary 
valued, we need to compute the performance of 2^ x 2^ = 256 control strategies and choose the 
one with the best performance. 

In this example, both controllers have a common observation Y*. One of the main ideas of this 
paper is to use such common information among the controllers to simplify the search process as 
follows. Instead of specifying the control strategies {g^,g^) directly, we consider a coordinated 
system in which a coordinator observes the common information Y* and chooses prescriptions 
{T\T^) where r is a mapping from Y' to U\ i = 1,2. Hence, (T^T^) = d{Y*), where d is 
called the coordination strategy. The coordinator then communicates these prescriptions to the 
controllers who simply use them to choose W = r*(F*), i = 1,2. 

It is easy to verify (see Proposition [3] for a formal proof) that choosing the control strategies 
{g^ ,g'^) in the original system is equivalent to choosing a coordination strategy d in the coordinated 
system. The problem of choosing the best coordination strategy, however, is a centralized problem 
in which the coordinator is the only decision-maker. 

For example, consider the case when all system variables are binary valued. For any coordination 
strategy d, let (7q, 7q) = (i(0) and (7^,71) = d{l). Then, the cost associated with this coordination 
strategy is given as: 

J{d) := EW[/(X,f/\f/2)] = P(F* = m[l{X,^l{Y'),^l{Y^))\Y* = 0] 

+ P(F* = l)E[/(X,7j(Fi),7?(r2))|r* = 1] 

To minimize the above cost, we can minimize the two terms separately. Therefore, to find the best 
coordination strategy d, we can search for optimal prescriptions for the cases F* = and Y* = 1 
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separately. Searching for the best prescriptions for each of these cases involves computing the 
performance of 2^ x 2^ = 16 prescription pairs and choosing the one with the best performance. 
Thus, to find the best coordination strategy, we need to evaluate the performance of 16 + 16 = 32 
prescription pairs. Contrast this with the 256 strategies whose costs we need to evaluate to solve 
the original problem by brute force. 

The above example described a static system and illustrates that common information can 
be exploited to convert the decentralized optimization problem into a centralized optimization 
problem involving a coordinator. In this paper, we build upon this basic idea and present a solution 
approach based on common information that works for dynamical decentralized systems as well. 
Our approach converts the decentralized problem into a centralized stochastic control problem (in 
particular, a partially observable Markov decision process), identifies structure of optimal control 
strategies, and provides a dynamic program like decomposition for the decentralized problem. 



B. Contributions of the Paper 

We introduce a general model of decentralized stochastic control problem in which multiple 
controllers share part of their information with each other. We call this model the partial history 
sharing information structure. This model subsumes several existing models of information 



sharing in decentralized stochastic control as special cases (see Section II-B ). We establish two 
results for our model. Firstly, we establish a structural property of optimal control strategies. 
Secondly, we provide a dynamic programming decomposition of the problem of finding optimal 



control strategies. As in [[T|, [36|, our results are derived using a common information based 



approach (see Section [III]). This approach differs from the person-by-person approach and the 
designer's approach mentioned earlier. In particular, the structural properties found in this paper 
cannot be found by the person-by-person approach described earlier. Moreover, the dynamic 
programming decomposition found in this paper is distinct from — and simpler than — the 
dynamic programming decomposition based on the designer's approach. For a general framework 



for using common information in sequential decision making problems, see [40| 



C. Notation 

Random variables are denoted by upper case letters; their realization by the corresponding lower 
case letter. For integers a <h and c < d, Xa-.b is a short hand for the vector {Xa, Xa+i, . . . , Xb) 
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while X'^'-'^ is a short hand for the vector {X'^,X'^^^, . . . ,X'^). When a > b, Xa-M equals the 
empty set. The combined notation X^:^ is a short hand for the vector (X- : i = a, a + 1, . . . ,b, 
j = c,c+ 1, . . . ,d). In general, subscripts are used as time index while superscripts are used 
to index controllers. Bold letters X are used as a short hand for the vector (X^-"). P(-) is the 
probability of an event, E(-) is the expectation of a random variable. For a collection of functions 
g, we use P^(-) and E*'(-) to denote that the probability measure/expectation depends on the 
choice of functions in g. Ia(') is the indicator function of a set A. For singleton sets {a}, we 
also denote l{a}(-) by la(-)- 

For a singleton a and a set B, {a, B} denotes the set {a} U B. For two sets A and B, {A, B} 
denotes the set AU B. For two finite sets A, B, F{A, B) is the set of all functions from A to 
B. Also, if ^ = 0, F{A,B) := B. For a finite set A, A{A) is the set of all probabiUty mass 
functions over A. For the ease of exposition, we assume that all state, observation and control 
variables take values in finite sets. 

For two random variables (or random vectors) X and Y taking values in X and y, P(X = x\Y) 
denotes the conditional probability of the event {X = x} given Y and P(X|F) denotes the 
conditional PMF (probabihty mass function) of X given Y, that is, it denotes the collection of 
conditional probabilities P(X = x\Y),x E X. Finally, all equalities involving random variables 
are to be interpreted as almost sure equalities (that is, they hold with probability one). 



D. Organization 

The rest of this paper is organized as follows. We present our model of a decentralized 
stochastic control problem in Section |llj We also present several special cases of our model in 



this section. We prove our main results in Section III We apply our result to some special cases 



in Section III-B We present a simplification of our result and a generalization of our model in 



Section IV We consider the infinite time -horizon discounted cost analogue of our problem in 



Section W\ Finally, we conclude in Section VI 



II. Problem Formulation 

A. Basic model: Partial History Sharing Information Structure 

1 ) The Dynamic System: Consider a dynamic system with n controllers. The system operates 
in discrete time for a horizon T. Let Xt E Xt denote the state of the system at time t, UJ: E Ul 
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denote the control action of controller i, i — 1,. . . ,n at time t, and denote the vector 

The initial state Xi has a probability distribution Qi and evolves according to 

Xt+,^ft{Xt,Vt,W,'), (1) 

where {Wj^jj^i is a sequence of i.i.d. random variables with probability distribution Q^,. 

2) Data available at the controller: At any time t, each controller has access to three types 
of data: current observation, local memory, and shared memory. 

(i) Current local observation: Each controller makes a local observation e yi on the state 
of the system at time t, 

Y;^hi{Xt,w:), (2) 

where {W^}J^i is a sequence of i.i.d. random variables with probability distribution Q\y. We 
assume that the random variables in the collection {Xi , , t = 1, . . . , T, j = 0, 1, . . . , n}, 
called primitive random variables, are mutually independent. 

(ii) Local memory : Each controller stores a subset of its past local observations and its 
past actions in a local memory: 

Mlc{Yl,_,,Ul,_^. (3) 

At i = 1, the local memory is empty, M{ = 0. 

(iii) Shared memory: In addition to its local memory, each controller has access to a shared 
memory. The contents Ct of the shared memory at time t are a subset of the past local 
observations and control actions of all controllers: 

CtCl{Y,..t-l,Ul:t-l} (4) 

where and denote the vectors (Y^^, . . . ,Y^) and {Ul,. . . , C/") respectively. Ait — 1, 
the shared memory is empty, Ci = 0. 
Controller i chooses action Ul as a function of the total data (F/, Ml, Ct) available to it. 
Specifically, for every controller i, i — 1, . . . , n, 

Ul^g\{Yi,MlCt), (5) 

where gl is called the control law of controller i. The collection g* = ((?^, . . . , g\) is called the 
control strategy of controller i. The collection g^-" = (g^, . . . , g") is called the control strategy 
of the system. 
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3) Update of local and shared memories: 

(i) Shared memory update: After taking the control action at time t, the local information 
at controller i consists of the contents Ml of its local memory, its local observation 

and its control action Ul- Controller i sends a subset Zl of this local information 
{Ml, Y^, Ul} to the shared memory. The subset Zl is chosen according to a pre-specified 
protocol. The contents of shared memory are nested in time, that is, the contents Ct+i of 
the shared memory at time i + 1 are the contents Ct at time t augmented with the new 
data Zt = {Zl,Zf,..., Z^) sent by all the controllers at time t: 

a+i-{Q,Zj. (6) 

(ii) Local memory update: After taking the control action and sending data to the shared memory 
at time t, controller i updates its local memory according to a pre-specified protocol. The 
content M^^i the local memory can at most equal the total local information {Ml, Y^, Ul} 
at the controller. However, to ensure that the local and shared memories at time t + 1 don't 
overlap, we assume that 

Mi^,c{Mi,Y^,Ui}\Zl. (7) 
Figure 1 shows the time order of observations, actions and memory updates. We refer to the 



Shared Memory 
Controller 1 

Controller n 



Ct 



Ml 



t + 1 



M" 



Fig. 1. Time ordering of Observations, Actions and Memory Updates 



above model as the partial history sharing information structure. 

4) The optimization problem: At time t, the system incurs a cost l{Xt,\Jt). The performance 
of the control strategy of the system is measured by the expected total cost 



1 



(8) 
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where the expectation is with respect to the joint probability measure on {Xi,t, Ui:t) induced 
by the choice of g^'". 

We are interested in the following optimization problem. 

Problem 1 For the model described above, given the state evolution functions ft, the observation 
functions h\, the protocols for updating local and share memory, the cost function I, the 
distributions Qi, Q\^, i = 0,1,..., n, and the horizon T, find a control strategy g^'" for 
the system that minimizes the expected total cost given by ([8]). 



B. Special Cases: The Models 

In the above model, although we have not specified the exact protocols by which controllers 
update the local and shared memories, we assume that pre-specified protocols are being used. 
Different choices of this protocol result in different information structures for the system. In 
this section, we describe several models of decentralized control systems that can be viewed 
as special cases of our model by assuming a particular choice of protocol for local and shared 
memory updates. 

1 ) Delayed Sharing Information Structure: Consider the following special case of the model 
of Section UTaI 

(i) The shared memory at the beginning of time t h Ct = {Yi,t-s, ^v.t-s], where s > 1 is a 
fixed number. The local memory at the beginning of time t is Ml = {^/_s+i:t-i) ^t-s+i:t-i}- 

(ii) At each time t, after taking the action Ul, controller i sends Zl = {Yl_^_^_^, UJ:_g^^} to the 
shared memory and the shared memory at t + 1 becomes Ct+i = {Yi:t_s+i, Ui:4_s+i}. 

(iii) After sending Zl = Ul_g^i} to the shared memory, controller i updates the local 
memory to Af^Vi = {>'/ls+2:t, f^i-.+2;t}- 

In this spacial case, the observations and control actions of each controller are shared with 
every other controller after a delay of s time steps. Hence, the above special case corresponds to 



the delayed sharing information structure considered in [34|, [36|, [41 1. 

2) Delayed State Sharing Information Structure: A special case of the delayed sharing 
information structure (which itself is a special case of our basic model) is the delayed state 



sharing information structure [35 1. This information structure can be obtained from the delayed 
sharing information structure by making the following assumptions: 
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(i) The state of the system at time t is a n-dimensional vector Xt = {Xl,X^, . . . , X"). 

(ii) At each time t, the current local observation of controller i is = XI, for i = 1,2, ... ,n. 
In this spacial case, the complete state vector Xt is available to all controllers after a delay of s 
time steps. 

3) Periodic Sharing Information Structure: Consider the following special case of the model 



of Section II- A where controllers update the shared memory periodically with period s > 1: 

(i) For time ks < t < {k + l)s, where k = 0,1,2,..., the shared memory at the beginning 
of time t is Ct = {Yi^^^, Ui:^^}. The local memory at the beginning of time t is = 

(ii) At each time t = (k + l)s, k = 1, 2, ... , after taking the action U]:, controller i sends 
Zl = {^fes+i (fc+i)s' ^L+i (A:+i)s} to the shared memory. At other times, each controller 
does not send anything (thus Zl = 0). 

(iii) After sending Zl to the shared memory, controller i updates the local memory to Ml_^_i = 
{Mi,Yl,Ul}\Zl. 

In this spacial case, the entire history of observations and control actions are shared periodically 
between controllers with period s. Hence, the above special case corresponds to the periodic 
sharing information structure considered in [SVJ. 

4) Control Sharing Information Structure: Consider the following special case of the model 
of Section III-AI 

(i) The shared memory at the beginning of time t is Ct = {Ui:(_i}. The local memory at the 
beginning of time t is Ml = {Ylj._i}. 

(ii) At each time t, after taking the action Ul, controller i sends Zl = {Ul} to the shared 
memory. 

(iii) After sending Zl = Ul to the shared memory, controller i updates the local memory to 

Jlfi _ yi 

In this spacial case, the control actions of each controller are shared with every other controller 
after a delay of 1 time step. Hence, the above special case corresponds to the control sharing 
information structure considered in [38j. 

5) No Shared Memory with or without finite local memory: Consider the following special 
case of the model of Section III-AI 
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(i) The shared memory at each time is empty, Ct = % and the local memory at the beginning 
of time t is Ml = {^/_s:t-i, f^t-sit-i}^ where s > 1 is a fixed number. 

(ii) Controllers do not send any data to shared memory, Zl = 0. 

(iii) At the end of time t, controllers update their local memories to M^^i = {^/-s+i t, f^t-s+i *}- 
In this special case, the controllers don't share any data. The above model is related to the 



finite-memory controller model of [42|. A related special case is the situation where the local 
memory at each controller consists of all of its past local observations and its past actions, that 

is. Mi = {Yi^,_„Ul,_^. 

Remark 1 All the special cases considered above are examples of symmetric sharing. That is, 
different controllers update their local memories according to identical protocols and the data sent 
by a controller to the shared memory is selected according to identical protocols. However, this 
symmetry is not required for our model. Consider for example, the delayed sharing information 
structure where at the end of time t, controller i sends Y^_^., Ul_^^ to the shared memory, with 
Sj, 2 = 1, 2, . . . , n, being fixed, but not necessarily identical, numbers. This kind of asymmetric 
sharing is also a special case of our model. □ 



in. Main Results 

For centralized systems, stochastic control theory provides two important analytical results. 
Firstly, it provides a structural result. This result states that there is an optimal control strategy 
which selects control actions as a function only of the controller's posterior belief on the state of 
the system conditioned on all its observations and actions till the current time. The controller's 
posterior belief is called its information state. Secondly, stochastic control theory provides a 
dynamic programming decomposition of the problem of finding optimal control strategies in 
centralized systems. This dynamic programming decomposition allows one to evaluate the optimal 
action for each realization of the controller's information state in a backward inductive manner. 

In this section, we provide a structural result and a dynamic programming decomposition for 
the decentralized stochastic control problem with partial information sharing formulated above 
(Problem [T]). The main idea of the proof is to formulate an equivalent centralized stochastic 
control problem; solve the equivalent problem using classical stochastic-control techniques; and 
translate the results back to the basic model. For that matter, we proceed as follows: 
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1) Formulate a centralized coordinated system from the point of view of a coordinator 
that observes only the common information among the controllers in the basic model, 
i.e., the coordinator observes the shared memory Ct but not the local memories {M^, 
i = 1, . . . ,n) or local observations {Y^, i = 1, . . . ,n). The coordinator chooses prescriptions 
Tt^{rl,..., r^), where is a mapping from (y/, M}) to U^, i ^ 1, . . . ,n. 

2) Show that the coordinated system is a POMDP (partially observable Markov decision 
process). 

3) For the coordinated system, determine the structure of an optimal coordination strategy 
and a dynamic program to find an optimal coordination strategy. 

4) Show that any strategy of the coordinated system is implementable in the basic model with 
the same value of the total expected cost. Conversely, any strategy of the basic model is 
implementable in the coordinated system with the same value of the total expected cost. 
Hence, the two systems are equivalent. 

5) Translate the structural results and dynamic programming decomposition of the coordinated 
system (obtained in stage 3) to the basic model. 

Stage 1: The coordinated system 

Consider a coordinated system that consists of a coordinator and n passive controllers. The 
coordinator knows the shared memory Ct at time t, but not the local memories (M^, i = 1, . . . , n) 
or local observations (Y^, i — 1, . . . At each time t, the coordinator chooses mappings 
rj : yi X Mi ^ Ui, i = 1,2, ... , n, according to 



where Ft = {Tj, Tf,..., F"). The function dt is called the coordination rule at time t and the 
collection of functions d :— [di, . . . , dr) is called the coordination strategy. The selected FJ is 
communicated to controller i at time t. 

The function F^ tells controller i how to process its current local observation and its local 
memory at time t; for that reason, we call F^ the coordinator's prescription to controller i. 
Controller i generates an action using its prescription as follows: 



Ft = dt{Ct,Ti:t-i), 



(9) 



ui^ri(Y;,Mi). 



(10) 
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For this coordinated system, the system dynamics, the observation model and the cost are the 
same as the basic model of Section II-A: the system dynamics are given by ([T]), each controller's 



current observation is given by (|2]) and the instantaneous cost at time t is l{Xt,Ut)- As before, 
the performance of a coordination strategy is measured by the expected total cost 

T 

(11) 



J(d) = E[^/(Xt,Ui 

t=i 



where the expectation is with respect to a joint measure on {Xi;t,'Ui:t) induced by the choice 
of d. 

In this coordinated system, we are interested in the following optimization problem: 
Problem 2 For the model of the coordinated system described above, find a coordination strategy 



d that minimizes the total expected cost given by (11). 



Stage 2: The coordinated system as a POMDP 

We will now show that the coordinated system is a partially observed Markov decision process. 



For that matter, we first describe the model of POMDPs [ |43| |. 

POMDP Model: A partially observable Markov decision process consists of a state process 
St G S, an observation process Ot E O, an action process At E A, t = 1,2, ... ,T, and a single 
decision-maker where 

1) The action at time t is chosen by the decision-maker as a function of observation and 
action history, that is, 

At = dt{Oi..t,Ai.,t.i), (12) 

dt is the decision rule at time t. 

2) After the action at time t is taken, the new state and new observation are generated according 
to the transition probability rule 

F{St+i,Ot+i\Si..t, Oi..t, Ai.,t) = F{St+i, Ot+i\St, At). (13) 

3) At each time, an instantaneous cost l{St, At) is incurred. 

4) The optimization problem for the decision-maker is to choose a decision strategy d : = 
{di, . . . , dx) to minimize a total cost given as 

T 



E[Y^l{St,At)]. (14) 



t=i 
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The following well-known results provides the structure of optimal strategies and a dynamic 
program for POMDPs. For details, see [j43|. 

Theorem 1 (POMDP Result) Let Qt be the conditional probability distribution of the state St 
at time t given the observations Oi:t and actions Ai.t-i, 

et{s) = T{St = s\Oi..t,Ai.,t.i), seS. 



Then, 



1) = rit{Qt^ At^Ot^i), where r]t is the standard non-linear filter: If 6t,at,ot+i are the 

realizations of Qt^ At and Ot+i, then the realization of s^^ element of the vector Qt+i is 

n / N _ Otis')F{St+i = s, Ot+i = Ot+i\St = s', At = at) 
~ Es,sH-^MSt+i = S,Ot+i = ot+i\St = s,At = at) 

=: r]'{9t,at,0t+i) (15) 

and r]t{9t,at,0t+i) is the vector {r]^i9t,at,0t+i))ses- 

2) There exists an optimal decision strategy of the form 

At = dti^t). 

Further, such a strategy can be found by the following dynamic program: 

VT{e) = inf E{/'(S'r,a)|er = 9}, (16) 

a 

and for 1 < t < T — 1, 

Vt{9) = miE{l{St,a) + Vt+i{vt{9,a,0t+i))\et = 9,At = a}. (17) 

We will now show that the coordinated system can be viewed as an instance of the above 
POMDP model by defining the state process as 5^ := {X^, Y^, M^}, the observation process as 
Ot := Zt_i, and the action process At := Ft. 

Lemma 1 For the coordinated system of Problem |2] 

1) There exist functions ft and ht, t = 1, . . . ,T, such that 

St+i = ft{St,rt,w^,Wt+i), (18) 

and 

Zt = ht{St,Tt). (19) 



September 11, 2012 



DRAFT 



16 



In particular, we have that 

1, Z( I'^l = F{St+i,Zt\St,rt 

2) Furthermore, there exists a function I such that 

i(Xt,iJt) = i{St,rt). 



Thus, the objective of minimizing (11) is same as minimizing 



T 



j{d) = E\J2KSt,rt 



t=i 



(20) 



(21) 



(22) 



Proof: The existence of ft follows from ([T]), ([2]), pO] ), (|7]) and the definition of St- The 



existence of ht follows from the fact that Zl is a fixed subset of {M^, Y^, U]:}, equation ( fTO] ) and 



the definition of St- Equation (20) follows from (18) and the independence of W^, W^+i from 



all random variables in the conditioning in the left hand side of (20). The existence of / follows 



from the definition of St and ( [T0| ). ■ 
Recall that the coordinator is choosing its actions according to a coordination strategy of the 
form 

Ft = dt{Ct, ri;t-i) = dt{Zi.,t-,, ri:t-i). (23) 

Equation ( [23] ) and Lemma [T] imply that the coordinated system is an instance of the POMDP 
model described above. 



Stage 3: Structural result and dynamic program for the coordinated system 

Since the coordinated system is a POMDP, Theorem [1] gives the structure of the optimal 
coordination strategies. For that matter, define coordinator's information state 

Ut := F{St I Zi:i_i, r,.,t^,) = F{St I Ct, r,.,t^,). (24) 

Then, we have the following: 

Proposition 1 For Problem^ there is no loss of optimality in restricting attention to coordination 
rules of the form 

Tt = dt{Iit). (25) 
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Furthermore, an optimal coordination strategy of the above form can be found using a dynamic 
program. For that matter, observe that we can write 

Ut+i = Vt{Ilt,Zt,rt) (26) 

where r]t is the standard non-linear filtering update function (see Appendix |A]). We denote by Bt 
the space of possible realizations of 11^. Thus, 

Bt := A{Xt xylxMlx ...X y; x M"^). (27) 



Recall that F{y'l: x M\Mt) is the set of all functions from yi x M\ to Ul (see Section |l-C[ ). 
Then, we have the following result. 

Proposition 2 For all 7it in Bt, define 

Vt{h)= inf E[/(5i,rT) |ni = 7r,rT = (7T,---,7T)], (28) 

and for 1 <t <T — 1, 

Vt{T^) = inf ni{St, Tt) + Vt+M'^t, Tt, Z,) | = vr, = (7/, . . . , 7")]. 

{')^<^F{yixM\lAl),l<i<n} 

(29) 

Then the arginf at each time step gives the coordinator's optimal prescriptions for the 
controllers when the coordinator's information state IS IT. □ 

Proposition [2] gives a dynamic program for the coordinator's problem (Problem [2]). Since the 
coordinated system is a POMDP, it implies that computational algorithms for POMDPs can be 
used to solve the dynamic program for the coordinator's problem as well. We refer the reader to 



1 44 1 and references therein for a review of algorithms to solve POMDPs. 



Stage 4: Equivalence between the two models 

We first observe that since Cs C Ct, for all s <t, under any given coordination strategy d, we 
can use Ct to evaluate the past prescriptions by recursive substitution. For example, for t = 2, 3, 
the past prescriptions can be evaluated as functions of C2, C3 as follows: 

ri = rfi(Ci)=: ^1(^2), 

r2 = rf2(C2,ri) = d2(C2,rfl(C2)) =: rf2(C3) 
September 11, 2012 DRAFT 
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We can now state the following result. 



Proposition 3 The basic model of Section II-A and the coordinated system are equivalent. More 
precisely: 

(a) Given any control strategy g^'" for the basic model, choose a coordination strategy d for 
the coordinated system of stage 1 as 

dt{Ct) = {9l{;;Ct),...,gn;;Ct)). 

Then J(d) = J(gi^"). 

(b) Conversely, for any coordination strategy for the coordinated system, choose a control 
strategy g^'" for the basic model as 

^i(-,-,Ci) = rfi(c^i), 

and 

where = dk{Ck, Ti-k^i), k = 1, 2, . . . , t — 1 and dl{-) is the i-th component of dt{-) (that 
is, d\{-) gives the coordinator's prescription for the i-th controller). Then, J(g^-") = J(d).n 

Proof: See Appendix |B} ■ 

Stage 5: Structural result and dynamic program for the basic model 

Combining Proposition [T] with Proposition [3] we get the following structural result for Problem [T| 

Theorem 2 (Structural Result for Optimal Control Strategies) In Problem |7] there exist op- 
timal control strategies of the form 

Ui=gliY;,Mi,U,), t = l,2,...,n, (30) 

where Ut is the conditional distribution on X^, Y(,M( given Ct, defined as 

ni(x,y,m) := P^--(Xi = x, = y, = m|a), (31) 

for all possible realizations (x,y,m) o/ (Xt, Yt, MJ. □ 

We call Tit the common information state. Recall that takes values in the set Bt defined in 



(27) 
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Consider a control strategy g' for controller i of the form specified in Theorem [2j The control 
law gl at time t is a function from the space yi x A4l x Bt to the space of decisions U]:. 
Equivalently, the control law gl can be represented as a collection of functions {gl{-, ■,7r)}7reBt, 
where each element of this collection is a function from yi x M.\ to til- An element gl{-, ■, vr) 
of this collection specifies a control action for each possible realization of F/, Ml and a fixed 
realization tt of n^. We call gl(-,-,TT) the partial control law of controller i at time t for the 
given realization vr of the common information state n^. 

We now use Proposition |2] to describe a dynamic programming decomposition of the problem 
of finding optimal control strategies. This dynamic programming decomposition allows us to 
evaluate optimal partial control laws for each realization vr of the common information state in a 
backward inductive manner. Recall that Bt is the space of all possible realizations of Ht (see 



(|27])) and F{yi x M\Mi) is the set of all functions from yi x M\ to Ul (see Section I-C). 



Theorem 3 (Dynamic Programming Decomposition) Define the functions Vt : Bt ^ ^ , for 

t = 1, . . . ,T as follows: 

VTiir) = inf E{/(XT,7T(>V\^T),---,7?(>^T,^T))|nT = vr}, (32) 

{TT^F{y^xMir,U:^),l<i<n} 

and for 1 < t < T — 1, 

Vt{7r)= . , inf _ E{l{Xt,^l{Y,\Mi),...,Wr,Mn) + 

{Yt^F{yixMl,Ul),l<i<n} 

\4+i(r/,(7r,7/,...,7r,Zt))|ni = 7r}, (33) 



where rjt is a Bt+i-valued function defined in ( |26l ) and Appendix^ 



For t = 1, . . . , T and for each vr G Bt, an optimal partial control law for controller i is the 
minimizing choice ofY the definition ofVt{TT). Let \E't(7r) denote the arginf of the right hand 
side ofVtlir), and denote its i-th component. Then, an optimal control stategy is given by: 

gl{;;n) = ^l{7r). (34) 



A. Comparison with Person by Person and Designer Approaches 

The common information based approach adopted above differs from the person-by-person 
approach and the designer's approach mentioned in the introduction. In particular, the structural 
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result of Theorem [2] cannot be found by the person-by- person approach. If we fix strategies of all 
but the ith controller to an arbitrary choice, then it is not necessarily optimal for controller i to 
use a strategy of the form in Theorem |2j This is because if controller j's strategy uses the entire 
common information Ct, then controller i, in general, would need to consider the entire common 
information to better predict controller j's actions and hence controller z's optimal choice of 
action may too depend on the entire common information. The use of common information 
based approach allowed us to prove that all controllers can jointly use strategies of the form in 
Theorem |2] without loss of optimality. 

The dynamic programming decomposition of Theorem [3] is simpler than any dynamic pro- 
gramming decomposition obtained using the designer's approach. As described earlier, the 
designer's approach models the decentralized control problem as an open-loop centralized 
planning problem in which a designer at each stage chooses control laws gl that map (F/, Mj , Ct) 
to UJ:, i = 1, . . . ,n. On the other hand, the common-information approach developed in this 
paper models the decentralized control problem as a closed-loop centralized planning problem in 
which a coordinator at each stage chooses the partial control laws 'jI that map (1^*, M^) to U^, 
i = 1, . . . ,n. The space of partial control laws is always smaller than the space of full control 
laws; if the common information is non-empty, then they are strictly smaller. Thus, the dynamic 
programming decomposition of Theorem[3]is simpler than that obtained by the designer's approach. 



This simplification is best illustrated by the example of Section IV-Cl where all controllers 
receive a common observation y^^°™. For this example, we show that our information state (and 
hence our dynamic program) reduce to P(Xt|y^':°™), which is identical to the information state 
of centralized stochastic control. In contrast, the information state F{Xt,Yf°"^) obtained by the 
designer's approach is much more complicated. 

B. Special Cases: The Results 



In Section II-B we described several models of decentralized control problems that are special 
cases of the model described in Section |II-A[ In this section, we state the results of Theorems |2] 
and [3] for these models. 

1 ) Delayed Sharing Information Structure: 



Corollary 1 In the delayed sharing information structure of section II-Bl there exist optimal 
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control strategies of the form 

f/; = ^i(^/-.+i.,f/ts+i:t-i,nO, 2 = l,2,...,n, (35) 

where 

Ut := P^--i(Xi,Yi_,+i;i,Ut_,+i;i_i|a). (36) 

Moreover, optimal control strategies can be obtained by a dynamic program similar to that of 
Theorem |5] □ 



The above result is analogous to the result in [36 1. 



2) Delayed State Sharing Information Structure: 



Corollary 2 In the delayed state sharing information structure of section II-B2 there exist 
optimal control strategies of the form 

Ui = gliXl,^,^„Ui_^^,^,_„U^, t = l,2,...,n, (37) 

where 

Ut := P^--(Xi_,+i.i,Ui_,+i.i_i|a). (38) 

Moreover, optimal control strategies can be obtained by a dynamic program similar to that of 

Theorem |5] □ 



The above result is analogous to the result in [36 1 
3) Periodic Sharing Information Structure: 



Corollary 3 In the periodic sharing information structure of section \II-B3\ there exist optimal 
control strategies of the form 

^; = ^*«Wi.,f^L+i:t-i,ni), ^ = l,2,...,n, ks<t<{k + l)s, (39) 

where 

Ut := P^--(Xi, Yfc,+i.i,Ufc,+i;t_i|a), ks<t<{k + l)s. (40) 

Moreover, optimal control strategies can be obtained by a dynamic program similar to that of 
Theorem |J] □ 
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The above result gives a finer dynamic programming decomposition that |37 1. In [37 1, the dynamic 
programming decomposition is only carried out at the times of information sharing, t = ks, 
s = 1, 2, . . . ; and at each step the partial control laws until the next sharing instant are chosen. 
In contrast, in the above dynamic program, the partial control laws of each step are chosen 
sequentially. 
4) Control Sharing Information Structure: 



Corollary 4 In the control sharing information structure of section II-B4 there exist optimal 
control strategies of the form 

Ui = gi{Yl„Ut), z = l,2,...,n, (41) 

where 

ni:=P§"-i(Xt,Yi;i|Ct). (42) 

Moreover, optimal control strategies can be obtained by a dynamic program similar to that of 
Theorem |5] □ 

5) No Shared Memory with or without finite local memory: 



Corollary 5 In the information structure of Section \II-B5\ there exist optimal control strategies 
of the form 

Ui = gl{Yi,MlIi,) (43) 

where 

ni = P^--(X„Yi,Mi) (44) 

Moreover, optimal control strategies can be obtained by a dynamic program similar to that of 
Theorem |5] □ 

Note that, since the common information is empty, the common information state lit is now an 
unconditional probability. In particular, is a constant random variable and takes a fixed value 
that depends only on the choice of past control laws. Therefore, we can define an appropriate 
control law gl such that M^*, lit) = dlO^t i ^t)^ with probability 1. Hence, the structural 



result of ( [43] ) may be simplified to 

Ul = gl{Yi,MlIl,) = ~g\{Yi,Ml). 
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This result is redundant since all control laws are of the above form. Nonetheless, Corollary |5] 
gives a procedure of finding such control laws using the dynamic program of Theorem |3] 

The above result is similar to the results in [42] for the case of one controller with finite 
memory and to those in [23] for the case of two controllers with finite memories. 

IV. Simplifications and Generalizations 

A. Simplification of the Common Information State 

Theorems [2] and |3] identify the conditional probability distribution on (X^, Y(,Mf) given Ct 
as the common information state for our problem. In the following lemma, we make the simple 
observation that in our model the conditional distribution on (X^, Yj, M^) given Ct is completely 
determined by the conditional distribution on (X^, Mt) given Ct- 

Lemma 2 For any choice of control laws g\'.t-i> define the conditional distribution on Xt,Mj 
given Ct as 

n^"'"(x,m) := P3i^*-i(Xt = xMt = m\Ct), 

for all possible realizations (x, m) of (Xt, MJ. Also define Bt'''^ := A(A't x M\ x . . . x A^"). 
Then, 

nr"(x,m) = 5^ni(x,y,m). (45) 
y 

Therefore, H"'^"' = xtij^t), where each component of the Bt'^^- valued function xt is determined 



by the right hand side of (45 1. Also, 



ni(x,y,m) = nr"(x,m)P(Yi = y\Xt = x), (46) 



where the second term on right hand side of (46) is determined by the fixed distribution of the 



observations noises. Therefore, lit = Ct(n"'^"'), where each component of the Bt- valued function 
(t is determined by the right hand side of (|46]). □ 



Lemma |2] implies that the results of Theorems |2] and [3] can be written in terms of 11^ 



new 
t 



Theorem 4 (Alternative Common Information State) In Problem [7] there exist optimal con- 
trol strategies of the form 

Ul = lliXi, Ml nr"), ^ = 1, 2, . . . , n, (47) 



September 11, 2012 



DRAFT 



24 

where 

Yinew . ^ p|i:?_i , | ) . (48) 

Further, define the functions V^"'^"' : gj^eui ^ ^ j-^^ t = 1, . . . ,T as follows: 

V^--(vr--)= ^ _ inf , E{/(XT,7^(y^^M^),...,7^(F^^M^))|^T = CTK^}, 

(49) 

and for 1 < t < T — 1, 

ynew^^new^^ E{ / (X„ 7^^!', M^) , . . . , 7," M,") ) + 

{7|eF(J't'xA^J,Wi'),l<j<n} 

v--(x*(^*(n*,7/, • • • ,7r, z,))) I n, = CtiTT'^n}, (50) 

w/iere Ct)Xt ^'"^ defined in Lemma^ and rjt is defined in ( |26l ) anJ A/)penJzx ^ 

For 1 < t < T and for each tt"^"', an optimal partial control law for controller i is the 
minimizing choice of Y the definition of V^^^ {ti'^'^'^^ 



Proof: For any vr"'^"' G B'^^'^ and any vr G it is straightforward to establish using a 
backward induction argument that v^'^''^ {'k''''''") = \4(Ct(7r"^"')) and Vtiir) = Vf^^^^ixtiT^)), where 
V^(-) is the value function from the dynamic program in Theorem |3] The optimality of the new 
dynamic program then follows from the optimality of the dynamic program in Theorem |3j ■ 
The result of Theorem |4] is conceptually the same as the results in Theorems |2] and [3j Theorem |4] 
implies that the Corollaries of Section |III-B| can be restated in terms of new information states 



by simply removing Yt from the definition of original information states. For example, the result 
of Corollary [T] for delayed sharing information structure is also true when Ut is replaced by 

n-- := P§!:T-i(Xt,Yt_,+i:i_i,Ut_,+i:t_i|Ci). (51) 

This result is simpler than that of p6| Theorem 2]. 

B. Generalization of the Model 



The methodology described in Section III relies on the fact that the shared memory is common 
information among all controllers. Since the coordinator in the coordinated system knows only 
the common information, any coordination strategy can be mapped to an equivalent control 



strategy in the basic model (see Stage 4 of Section III). In some cases, in addition to the shared 
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memory, the current observation (or if the current observation is a vector, some components of 
it) may also be commonly available to all controllers. The general methodology of Section 2 can 
be easily modified to include such cases as well. 



Consider the model of Section II-A with the following modifications: 

1) In addition to their current local observation, all controllers have a common observation at 
time t. 

Y^'°'^ = h't°"'{Xt,Vt) (52) 

where {Vt, t = 1, . . . , T} is a sequence of i.i.d. random variables with probability distribution 
Qv which is independent of all other primitive random variables. 

2) The shared memory Ct at time t is a subset of {Y]^°™^, Yi-^^i, Ui^^.i}. 

3) Each controller selects its action using a control law of the form 

Ui = gliY,\Mi,Ct,Yrn- (53) 

4) After taking the control action at time t, controller i sends a subset Z'l of {Mj , y/, U'l, 
that necessarily includes That is, 

Yr"'czic{Mi,Y;,ui,Yr"'}. 

This implies that the history of common observations is necessarily a part of the shared 
memory, that is, Fi^"!"! C Ct. 



The rest of the model is same as in Section II-A In particular, the local memory update satisfies 
(|7]), so the local memory and shared memory at time t + 1 don't overlap. The instantaneous cost 
is given by l{Xt, Ut) and the objective is to minimize an expected total cost given by ([8]). 
The arguments of Section [111] are also valid for this model. The observation process in Lemma [T] 



is now defined as Rt+i = {Tit.Y^""^}. The analysis of Section III leads to structural results and 
dynamic programming decompositions analogous to Theorems [2] and [3] with lit now defined as 

n, := p^i-i (Xi, Miia, Yrn- (54) 

Using an argument similar to Lemma |2| we can show that the result of Theorem |4] is true for 
the above model with Yi^^"^ defined as 

Y^nev, := p9l;r-i (Xi, |Ct, r/"'"). (55) 
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C. Examples of the Generalized Model 

1 ) Controllers with Identical Information: Consider the following special case of the above 
generalized model. 

1) All controllers only make the common observation y^'^"'"; controllers have no local 
observation or local memory. 

2) The shared memory at time t is Ct = Yf^-i- Thus, at time t, all controllers have identical 
information given as {Ct, Y^"'^} = Yf"™'. 

3) After taking the action at time t, each controller sends Zl = Y^^°™- to the shared memory. 



Recall that the coordinator's prescription in Section III are chosen from the set of functions 
from yi X Ml to U]:. Since, in this case y'l = M\ = 0, we interpret the coordinator's prescription 
as prescribed actions. That is, V\ = f/j. With this interpretation, the common information state 
becomes 

:= pfi:"-i(Xf|Fi':r) (56) 
and the dynamic program of Theorem [3] becomes 

Vt{ii)= inf E{/(XT,MT,---,Wr)|nT = vr}, (57) 
and for 1 < t < T - 1, 

Vt{n) = inf E{/(Xi,n,\...,0 + l^i+i(r/i(7r,M^...,<,F™r))|nt = vr}. (58) 

{?iJeW^),l<j<n} 

Since all the controllers have identical information, the above results correspond to the centralized 
dynamic program of Theorem [T] with a single controller choosing all the actions. 

2) Coupled subsystems with control sharing information structure: Consider the following 
special case of the above generalized model. 

1) The state of the system at time t is a (n+1) -dimensional vector Xt = (Xf, Xf, . . . , Xf, X*), 
where XI, i = 1, . . . ,n corresponds to the local state of subsystem i, and X^ is a global 
state of the system. 

2) The state update function is such that the global state evolves according to 

x,Vi = /;(x;,u„ivO), 

while the local state of subsystem i evolves according to 

xi, = fiixix:,Vt,Ni), 
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where {N^,t = 1, . . .T}, . . . , {NJ^,t = 1,...T} are mutually independent i.i.d noise 
processes that are independent of the initial state, Xi = (Xl, Xf, . . . , X", X^). 

3) At time t, the common observation of all controllers is given by Yf°"^ = X^. 

4) At time t, the local observation of controller i is given by = XI, i = 1, . . . ,n. 

5) The shared memory at time t is, Ct = Ui:t_i}. At each time t, after taking the 
action U]:, controller i sends Zl = {X^, U^} to the shared memory. 

The above special case corresponds to the model of coupled subsystems with control sharing 
considered in p9|, where several applications of this model are also presented. It is shown in 



1 39 1 that there is no loss of optimality in restricting attention to controllers with no local memory, 
i.e., Mt = 0. With this additional restriction, the result of Theorems 1 and 2 apply for this model 
with Ut defined as 

n, := P^-- {X:,Xl X-\Xl,, Ui,_i). 



Note that Ut can be evaluated from X; and F^i-t-i^Xl, . . . , |X*^, Vi.t-i). It is shown in (39j 
that X/, Xf , . . . , X" are conditionally independent given XJ".^, \Ji;t-i, hence the joint distribution 
P^i "-i(X/, . . . , X"|X^.j, Ui:(_i) is a product of its marginal distributions. 

3) Broadcast information structure: Consider the following special case of the above general- 
ized model. 

1) The state of the system at time t is a n-dimensional vector Xt = (X/, X^^, . . . , X"), where 
XI, i = 1, . . . ,n corresponds to the local state of subsystem i. The first component i = 1 is 
special and called the central node. Other components, i = 2, . . . ,n, are called peripheral 
nodes. 

2) The state update function is such that the state of the central node evolves according to 

Xl, = fliXlUl,Nl) 
while the state of the peripheral nodes evolves according to 

Xl_,, = fiiXlXlUi,Ul,Ni) 

where {N]:, i = 1,2, ... n;t = 1, • • • } are noise processes that are independent across time 
and independent of each other. 

3) At time t, the common observation of all controllers is given by Y^'^°"^ = X^. 
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4) At time t, the local observation of controller i, i > 2, is given by = X^. Controller 1 
does not have any local observations. 

5) No controller sends any additional data to the shared memory. Thus, the shared memory 
consists of just the history of common observations, i.e., Ct = Yi^^l'i = ^i t-i- 

The above special case corresponds to the model of decentralized systems with broadcast 



structure considered in [19|. It is shown in [19| that there is no loss of optimality in restricting 



attention to controllers with no local memory, i.e., Mt = 0. With this additional restriction, the 
result of Theorems 1 and 2 apply for this model with defined as 

n, :=p^--(x,\...,xr|x^,). 

Note that Ut can be evaluated from X/ and F^^-"'^{X^, . . . ,Xf IX^^J. It is shown in [19] that 
X^, . . . , X" are conditionally independent given X^.^, hence the joint distribution psi t-i {Xf, . . . , X" 
is a product of its marginal distributions. 

V. Extension to infinite horizon 



In this Section, we consider the basic model of Section |II-A| with an infinite time horizon. 
Assume that 

(i) The state of the system, the observations and the control actions take value in time-invariant 
sets X,y\l/{\ respectively. 

(ii) The local memories MJ: and the updates to the shared memory Zl take values in time- 
invariant sets A^' and respectively. 

(iii) The dynamics of the system (equation ([T])) and the observation model (equation (|2])) are 
time-homogeneous. That is, the functions ft and ht in equations ([T]) and ([2]) do not vary 
with time. 

Let the cost of using a strategy g^-" be defined as 



oo 

J(gl:n) E^'-"[Y^P'^H{Xt,lJt 
t=l 



(59) 



where /3 G [0, 1) is a discount factor. We can follow the arguments of Section III to formulate 



the problem of the coordinated system with an infinite time horizon. As in Section III the 



coordinated system is equivalent to a POMDP. The time-homogeneous nature of the coordinated 
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system and its equivalence to a POMDP allows us to use known POMDP results (see p3| ) to 
conclude the following theorem for the infinite time horizon problem. 

Theorem 5 Consider Problem [7] with infinite time horizon and the objective of minimizing the 
expected cost given by equation ([59]). Then, there exists an optimal time-invariant control strategy 
of the form: 

Ui = g\Y;,Mi,Ut), z = l,2,...,n, (60) 
Furthermore, consider the fixed point equation, 

V{7r)= . , inf E{l{X„^l{Y,\Mi),...,^nY-,Mn) + 

{Y£F{y^xM^i^'),l<i<n} 

/3\/(r/i(7r,7i,...,7",Z,))|ni = 7r}. (61) 

Then, for any realization tt of Uf, the optimal partial control laws are the choices of 7* that 
achieve the infimum in the right hand side of (|6T]). □ 



All the special cases of our information structure considered in Sections II-B and IV-C can be 
extended to infinite horizon problems if the state, observation and actions spaces are time-invariant 
and the systems dynamics and observation equations are time homogeneous. The only exception 



is the control sharing information structure of section II-B4 where the local memory takes values 
in sets that are increasing with time. 



The Case of No Shared Memory: As discussed in Section pII-B[ if the shared memory is always 
empty then the common information state defined in Theorem |2] is the unconditional probability 
lit = Yj, Mt). In particular, is a random variable that takes a fixed (constant) 

value which depends only on the choice of past control laws. Therefore, for any function gl 

of F/, Mi, lit, there exists a function gl of y/, Ml such that gi{Y^\ M}) = g\{Ji, Ml, H^) with 



probability 1. While Theorem establishes optimality of a time-invariant g\, such time-invariance 



may not hold for the corresponding g\. Similar observations were reported in [25| 



VI. Discussion and Conclusions 

In centralized stochastic control, the controller's belief on the current state of the system 
plays a fundamental role for predicting future costs. If the control strategy for the future is 
fixed as a function of future beliefs, then the current belief is a sufficient statistic for future 
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costs under any choice of current action. Hence, the optimal action at the current time is only a 
function of current belief on the state. In decentralized problems where different controllers have 
different information, using a controller's belief on the state of the system presents two main 
difficulties: (i) Since the costs depend both on system state as well as other controllers' actions 
any prediction of future costs must involve a belief on system state as well as some means 
of predicting other controllers' actions, (ii) Secondly, since different controllers have different 
information, the beliefs formed by each controller and their predictions of future costs cannot be 
expected to be consistent. 

The approach we adopted in this paper tries to address these difficulties by using the fact that 
sharing of data among controllers creates common knowledge among the controllers. Beliefs 
based on this common knowledge are necessarily consistent among all controllers and can serve 
as a consistent sufficient statistic. Moreover, while controllers cannot accurately predict each 
other's control actions, they can know, for the observed realization of common information, the 
exact mapping used by each controller to map its local information to control action. These 
considerations suggest that common information based beliefs and partial control laws should 
play an important role in a general theory of decentralized stochastic control problems. The use 
of a fictitious coordinator allows us to make these considerations mathematically precise. Indeed, 
the coordinator's beliefs are based on common information and the coordinator's decision are 
the partial control laws. The results of the paper then follow by observing that the coordinator's 
problem can be viewed as a POMDP by identifying a new state that includes both the state of 
the dynamic system as well as the local information of the controllers. 

The specific model of shared and local memory update that we assumed is crucial for connecting 
the coordinator's problem to POMDPs and centralized stochastic control. A key assumption in 
centralized stochastic control is perfect recall, that is, the information obtained at any time is 
remembered at all future times. This is essential for the update of the beliefs in POMDPs. Our 
assumption that the shared memory is increasing in time ensures that the perfect recall property 
is true for the coordinator's problem. If the shared memory did not have perfect recall (that is, if 



some past contents were lost over time), then the update of common information state in ( [26| ) 
would not hold and the results of Theorems [2] and [3] would not be true. 

Another key factor in our result is that St := {Xt, Yt, Mf} serves as a state for the coordinator's 
problem. If the system state, observations and local memories take value in a time-invariant 
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space, we have a state for the coordinator's problem which takes value in a time-invariant space. 
Hence, the common information state is a belief on a time-invariant space. The local memory 
update in ^ ensures that St is a state. If local memory update depended on shared memory as 
well, that is, if Q were replaced by 

Mi^,c{Ct,Mi,Yl,Ui}, 

then St would no longer suffice as a state for the coordinator. In particular, the state update 
equations in Lemma [T] would no longer hold. The only recourse then would be to include Ct 
as a part of the state which would necessarily mean that the state space keeps increasing with 
time. This is undesirable not only because large state spaces imply increased complexity, but the 
increasing size of state spaces also makes extensions of finite horizon results to infinite horizon 
problems conceptually difficult. 

The connection between the coordinator's problem and POMDPs can be used for computational 
purposes as well. The dynamic program of Theorem [3] is essentially a POMDP dynamic program. 
In particular, just as in POMDP, the value-functions are piecewise linear and concave in tt^. This 
characterization of value functions is utilized to find computationally efficient algorithms for 
POMDPs. Such algorithmic solutions to general POMDPs are well-studied and can be employed 



here. We refer the reader to |44| and references therein for a review of algorithms to solve 
POMDPs. 

While our results apply to a broad class of models, it would be worthwhile to identify 
special cases where the specific model features can be exploited to simplify our structural result. 



Examples of such simplification appear in p9| |, p9| . A common theme in many centralized 
dynamic programming solutions is to identify a key property of the value functions and use 
it to characterize the optimal decisions. Since our results also provide a dynamic program, an 
important avenue for future work would be to identify cases where properties of value functions 
can be analyzed to deduce a solution or to reduce the computational burden of finding the 
solution. 

Our approach in this paper illustrates that common information provides a common conceptual 
framework for several decentralized stochastic control problems. In our model, we explicitly 
included a shared memory which naturally served the purpose of common information among 
the controllers. More generally, we can define common information for any sequential decision- 
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making problem and then address the problem from the perspective of a coordinator who knows 
the common information. Such a common information based approach for general sequential 



decision-making problems is presented in [40| 
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The Update Function rjt of the Coordinator's Information State 

Consider a realization q+i of the shared memory Ct+i at time t + 1. Let (7i;f) be the 
corresponding realization of the coordinator's prescriptions until time t. We assume the realization 
(q+i, 7ri:t, 7i:t) to bc of non-zero probability. Then, the realization -nt+i of H-t+i is given by 



2009. 



Appendix A 




(62) 



Use Lemma [T] to simplify the above expression as 



(63) 



Since Ct+i = (ct, zt), write the last term of ( [63] ) as 




Use Lemma [T] and the sequential order in which the system variables are generated to write 



the numerator as 



't = St\Ct,-fi:t} 



(65) 




(66) 
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where we dropped from conditioning in ([65]) since under the given coordinator's strategy, it is 



a function of the rest of the terms in the conditioning. Substitute ( |66| ), ([64]), and ( [63] ) into ([62]), 
to get 



where ?7f (•) is given by ( [62] ), ([63]), ([64]), and ([66]). rit{-) is the vector {rif{-))s(zs. 



Appendix B 
Proof of Proposition [3] 

(a) For any given control strategy g^-" in the basic model, define a coordinated strategy d for 
the coordinated system as 

dt{a)={gl{-r,Ct),...,g''{;;Ct)). (67) 

Consider Problems [I] and [2] Use control strategy g^'" in Problem [T] and coordination strategy 
d given by ([67]) in Problem ]2] Fix a specific realization of the primitive random variables 



{Xi,Wf ,t = l,...,T,j = 0,1,..., ra} in the two problems. Equation (|2]) implies that the 



realization of Yi will be the same in the two problems. Then, the choice of d according to < pT) 
implies that the realization of the control actions Ui will be the same in the two problems. 
This implies that the realization of the next state X2 and the memories M2, C2 will be the 
same in the two problems. Proceeding in a similar manner, it is clear that the choice of d 
according to ([67]) implies that the realization of the state {Xt] t = 1, . . . , T}, the observations 
{Yi; t = 1, . . . ,T}, the control actions {JJu t = 1, . . . , T} and the memories {M^; t = 1, . . . , T} 
and {Ct; t = 1, . . . ,T} are all identical in Problem [T] and [2] Thus, the total expected cost under 



gi:n Problem [T] is same as the total expected cost under the coordination strategy given by ( [67] ) 
in Problem D That is, J(g^^") = J(d). 

(b) The second part of Proposition ]3] follows from similar arguments as above. 

Appendix C 

Equivalence between the model of this paper and the model of [[T| 

We refer to the model of this paper as the PHS (partial history sharing) model and the model 
of [[T| as the CO (common observation) model. First, we describe the CO model and then show 
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the both models are equivalent by showing that the PHS model is a special case of CO model 
and vice versa. 

The CO Model 

The following model was presented in [[T|; we use a slightly different notation so that the 
notation matches with that of our paper. 

Consider a system with n controllers. Let Xt denote the state of the system, Zt denote the 
common observation of all controllers, denote the private observation of controller i, M'l the 
contents of the memory of controller i, and Ul the control action of controller i, i = 1, . . . ,n. 

The system dynamics and observation equations are given by 

Xm = /t(^t,f/t'^",VrO), (68) 
Y; = hliXt,Ul--'-\Wi), 2 = l,...,n, (69) 
Zt = ct{Xt,Uli\,Qt), (70) 

where {Xi, Qt, W^, i = 0, . . . ,n,t = 1, . . . ,T} are independent random variables. 
At time t, controller i generates a control action and updates its memory as follows: 

Ui=gliZ^^t,Y;,Mi_^, (71) 
Mi = rliZ,.,t,Y,\Mi_,). (72) 

At each time an instantaneous cost lt{Xt, U}'^') is incurred. The system objective is to choose 
a control strategy g\-J^ and a memory update strategy r\-J^ to minimize a total expected cost. 

The PHS model is a special case of CO model 

Consider the PHS model described in Sec II-A of the paper and define 

x, = (x„r/- M,i-,zi-), 

Ui = Ul i = l,...,n 

F; = (F/,M^), 2 = 1,. ..,n 

— 

M/ = 0, i = l,...,n. 
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Define the cost function 

It is easy to verify that the model {Xt: Ul''^, Yt'^, M^'", Zf) defined above is a special case of 
CO model. 

The CO model is a special case of PHS model 

In the CO model, the local observations Y^ of controller i depends on the control action 
This feature is not present in PHS model. Nonetheless, we can show that CO model is a special 
case of the PHS model by sphtting time and assuming that in the PHS model only one controller 
acts at each time. 

Define the following system variables for r = 1, . . . , nT. For ease of notation, when tn < 

T < {t + l)n, we will write r as tn + i. Thus, the system variables are defined for t = 1, . . . , T 
and i = 1, . . . , n: 

= {Xt, MlL\), Xtr^+i = {Xt, Ur\ Ml--'-\ Mt"i), i = 2, . . . , n, 
Ztn+i — Ztn+i — ^1 i = 2, . . . , n, 



0, otherwise; 



AUlMll ifi = i, . 
0, otherwise; 

m;„+^. = 0, i = 

Define the cost function as: 

if i = n, 
0, otherwise. 

It is easy to verify that the model {Xt, C/^'", ly'", M^'", Zt) defined above is a special case 
of PHS model. 
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